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Abstract 

We consider receiver design for coded transmission over linear Gaussian channels. We 
restrict ourselves to the class of lattice codes and formulate the joint detection and decoding 
problem as a closest lattice point search (CLPS). Here, a tree search framework for solving the 
CLPS is adopted. In our framework, the CLPS algorithm decomposes into the preprocessing 
and tree search stages. The role of the preprocessing stage is to expose the tree structure in 
a form matched to the search stage. We argue that the minimum mean square error decision 
feedback (MMSE-DFE) frontend is instrumental for solving the joint detection and decoding 
problem in a single search stage. It is further shown that MMSE-DFE filtering allows for using 
lattice reduction methods to reduce complexity, at the expense of a marginal performance 
loss, and solving under-determined linear systems. For the search stage, we present a generic 
method, based on the branch and bound (BB) algorithm, and show that it encompasses all 
existing sphere decoders as special cases. The proposed generic algorithm further allows for an 
interesting classification of tree search decoders, sheds more light on the structural properties 
of all known sphere decoders, and inspires the design of more efficient decoders. In particular, 
an efficient decoding algorithm that resembles the well known Fano sequential decoder is 
identified. The excellent performance-complexity tradeoff achieved by the proposed MMSE- 
Fano decoder is established via simulation results and analytical arguments in several MIMO 
and ISI scenarios. 



1 Introduction 

Recent years have witnessed a growing interest in the closest lattice point search (CLPS) problem. 
This interest was primarily sparked by the connection between CLPS and maximum likelihood 
(ML) decoding in multiple- input multiple-output (MIMO) channels ^2j. On the positive side, 
MIMO channels offer significant advantages in terms of increased throughput and reliability. The 
price entailed by these gains, however, is a more challenging decoding task for the receiver. For 
example, naive implementations of the ML decoder have complexity that grows exponentially with 
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the number of transmit antennas. This observation inspired several approaches for sub-optimal 
decoding that offer different performance-complexity tradeoffs (e.g., [HI [26 ). 

Reduced complexity decoders are typically obtained by exploiting the codebook structure. The 
scenario considered in our work is no exception. In principle, the decoders considered here exploit 
the underlying lattice structure of the received signal to cast the decoding problem as a CLPS. 
Some variants of such decoders are known in the literature as sphere decoders (e.g., [HI EH EJ 
USUI!!)])- These decoders typically exploit number-theoretic ideas to efficiently span the space of 
allowed codewords (e.g., B=2J). The complexity of such decoders were shown, via simulation 
and numerical analysis, to be significantly smaller than the naive ML decoder in many scenarios of 
practical interest (e.g., [EHM]). The complexity of the state of the art sphere decoder, however, 
remains prohibitive for problems characterized by a large dimensionality This observation is 
one of the main motivations for our work. 

The overriding goal of our work is to establish a general framework for the design and analy- 
sis of tree search algorithms for joint detection and decoding. Towards this goal, we first divide 
the decoding task into two interrelated stages; namely, 1) preprocessing and 2) tree search. The 
preprocessing stage is primarily concerned with exposing the underlying tree structure from the 
noisy received signal. Here, we discuss the integral roles of minimum mean square error decision 
feedback (MMSE-DFE) filtering, lattice reduction techniques, and relaxing the boundary control 
(i.e., lattice decoding) in tree search decoding. We then proceed to the search stage where a general 
framework based on the branch and bound (BB) algorithm is presented. This framework estab- 
lishes, rigorously, the equivalence in terms of performance and complexity between different sphere 
and sequential decoders. We further use the proposed framework to classify the different search 
algorithms and identify their advantages/ disadvantages. The MMSE-Fano decoder emerges as a 
special case of our general framework that enjoys a favorable performance-complexity tradeoff. We 
establish the superiority of the proposed decoder via numerical results and analytical arguments in 
several relevant scenarios corresponding to coded as well as uncoded transmission over MIMO and 
inter-symbol-interference (ISI) channels. More specifically, in our simulation experiments, we apply 
the tree search decoding framework to uncoded V-BLAST [27], linear dispersion space-time codes 
33 , algebraic space-time codes [3TJ EU HOj , and trellis codes over ISI channels [T7j. In all these 
cases, our results show that the MMSE-Fano decoder achieves near-ML performance with a much 
smaller complexity. 

The rest of the paper is organized as follows. Section[2]introduces our system model and notation. 
In Section [HJ we consider the design of the preprocessing stage and discuss the interplay between 
this stage and the tree search stage. In Section we present a general framework for designing 
tree search decoders based on the branch and bound (BB) algorithm. In Section we establish 
the superior performance-complexity tradeoff achieved by the proposed MMSE-Fano decoder, using 
analytical arguments and numerical results, in several interesting scenarios. Finally, we offer some 
concluding remarks in Sectional 

2 System Model 

We consider the transmission of lattice codes over linear channels with white Gaussian additive 
noise (AWGN). The importance of this problem stems from the fact that several very relevant 
applications arising in digital communications fall in this class, as it will be illustrated by some 
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examples at the end of this section. Let A C R m be an m-dimensional lattice, i.e., the set of points 

A = {A = Gx : xGZ m } (1) 

where G G W mxm is the lattice generator matrix. Let v G M. m be a vector and TZ a measurable 
region in M m . A lattice code C(A, v, 1Z) is defined j2H I2SI UE] as the set of points of the lattice 
translate A + v inside the shaping region 71, i.e., 

C(A,v,K) = {A + v} mi. (2) 

Without loss of generality, we can also see C(A,v,7V) as the set of points c + v, such that the 
codewords c are given by 

c = Gx, for xGW (3) 

where U C Z m is the code information set. 

The linear additive noise channel is described, in general, by the input-output relation 

r = H(c + v) + z (4) 

where rGl" denotes the received signal vector, z ~ A/"(0, 1) is the AWGN vector, and H G M nxm 
is a matrix that defines the channel linear mapping between the input and the output. 

Consider the following communication problem: a vector of information symbols x is generated 
with uniform probability over U, the corresponding codeword c = Gx is produced by the encoder 
and the signal c + v is transmitted over the channel (j3J). Assuming H and v known to the receiver, 
the ML decoding rule is given by 

x = argmin |r — Hv — HGx| 2 (5) 

The constraint U C Z m implies that the optimization problem in © can be viewed as a constrained 
version of the CLPS with lattice generator matrix given by HG and constraint set U. 
A few remarkable examples of the above framework are: 

1. MIMO flat fading channels: One of the simplest and most widely studied examples is a 
MIMO V-BLAST system with squared QAM modulation, M transmit and N receive antennas, 
operating over a flat Rayleigh fading channel. The baseband complex received signal 1 in this 
case can be expressed as 




/V + z c (6) 

where the complex channel matrix H c G C NxM is composed of i.i.d elements /i° ■ ~ A/c(0, 1), 
the input complex signal c c has components c\ chosen from a unit-energy Q 2 -QAM constel- 
lation, the noise has i.i.d. components z\ ~ A/c(0, 1) and p denotes the signal to noise ratio 
(SNR) observed at any receive antenna. The system model in © can be expressed in the 
form of by appropriate scaling and by separating the real and imaginary parts using the 
vector and the matrix transformations defined by 



u c ^ u = [Re{u c } , Im{u c } 



T-.T 



1 We use the superscript c to denote complex variables. 
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M c i— > M 



Re{M c } -Im{M c } 
Im{M c } Re{M c } 

The resulting real model is given by (J1J) where n = 2N, m = 2M and the constraint set is 
given by U = Zg, with Zg = {0, . . . , Q — 1} denoting the set of integers residues modulo Q. 

In the case of V-BLAST, the lattice code generator matrix G = k1, where k is a normalizing 
constant, function of Q, that makes the (complex) transmitted signal of unit energy per 
symbol. This formulation extends naturally to MIMO channels with more general lattice 
coded inputs [TH]. In general, a space-time code of block length T is defined by a set of 



matrices C c 



in C 



MxT 



The columns of the codeword C c are transmited in 



parallel on the M transmit antennas in T channel uses. The received signal is given by the 
sequence of vectors 




+ z c t , t 



T 



(7) 



Ii 



Lattice space-time codes are obtained by taking a lattice code C(A, v, 1Z) in M. 2MT , and map- 
ping each codeword c into a complex matrix C c according to some linear one-to-one mapping 
jg,2MT _^ £MxT^ j^. j g eag y £ gee faofc a lattice-coded MIMO system can be again expressed 
by (pfl) where the channel matrix H is proportional (through an appropriate scaling factor) to 
the block-diagonal matrix 

" Re{H c } -Im{H c } " 
Im{H c } Re{H c } 

In this case, we have n = 2NT and m = 2MT. It is interesting to notice that for a wide 
class of linear dispersion (LD) codes jS3 EE UH1 HI ESj, the information set U is still given 
by Zg, as in the simple V-BLAST case, although the generator matrix G is generally not 
proportional to I. For other classes of lattice codes [IB], with more involved shaping regions 
1Z, the information set IA does not take on the simple form of an "hypercube" . For example, 
consider A obtained by construction A [TO], i.e., A = C + QZ m , where C C Zg is a linear code 
over Zg with generator matrix in systematic form [I, P T ] T . A generator matrix of A is given 

by CD] 

" I 
P QI 



G 



(9) 



Typically, the shaping region 1Z of the lattice code C(A, v, TV) can be an m-dimensional sphere, 
the fundamental Voronoi region of a sublattice A' C A, or the m-dimensional hypercube. In 
all these cases, the information set IA may be difficult to describe. 

2. ISI Channels: For simplicity, we consider a baseband real single-input single-output (SISO) 
inter-symbol-interference (ISI) channel with the input and output sequences related by 



^2 hid 



+ Zi 



where (h , . . . , hi) denotes the discrete-time channel impulse response, assumed of finite length 
L + 1. The extension to the complex baseband model is immediate. Assuming that the 
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transmitted signal is padded by L zeroes, the channel can be written in the form where 
the channel matrix takes on the tall banded Toeplitz form 

h h 



H = h L ' ■ ■ ' ■ ■ ho 
h L '■■ hi 

hr, 



A wide family of trellis codes obtained as coset-codes [211123, including binary linear codes, 
can be formulated as lattice codes where A is a Construction A lattice and the shaping region 
1Z is chosen appropriately. In particular, coded modulation schemes based on the Q-PAM 
constellation obtained by mapping group codes over TLq onto the Q-PAM constellation can be 
seen as lattice codes with hypercubic shaping 1Z. The important case of binary convolutional 
codes falls in this class for Q = 2. Again, the information set U corresponding to 1Z may, in 
general, be very complicated. 



3 The Preprocessing Stage 

In our framework, we divide the CLPS into two stages; namely, 1) preprocessing and 2) tree search. 
The complexity and performance of CLPS algorithms depend critically on the efficiency of the 
preprocessing stage. Loosely, the goal of preprocessing is to transform the original constrained 
CLPS problem, described by the lattice generator matrix HG and by the constraint set U, into a 
form which is friendly to the search algorithm used in the subsequent stage. In the following, we 
discuss the different tasks performed in the preprocessing stage. In general, a friendly tree structure 
can be exposed through three steps: left preprocessing, right preprocessing, and forming the tree. 

Some options for these three steps are illustrated in the following subsections. However, before 
entering the algorithmic details, it is worthwhile to point out some general considerations. The 
classical sphere decoding approach to the solution of the original constrained CLPS problem (J5J) 
consists of applying QR decomposition on the combined channel and code matrix, i.e., letting 
HG = QR where Q G R nxm has orthonormal columns and R G R mxm is upper triangular. Using 
the fact that Q is an isometry with respect to the Euclidean distance, © can be written as 

x = arg min 

where y' = Q T (r — Hv). If rank(HG) = m, R has non-zero diagonal elements and its triangular 
form can be exploited to search for all the points x6W such that Rx is in a sphere of a given search 
radius centered in y'. If the sphere is non-empty, the ML solution is guaranteed to be found inside 
the sphere, otherwise, the search radius is increased and the search is restarted. Different variations 
on this main theme have been proposed in the literature, and will be reviewed in Section 0] as special 
cases of a general BB algorithm. Nevertheless, it is useful to point out here the two main sources 



Rx 



(10) 
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of inefficiency of the above approach: 1) It does not apply to the case rank(HG) < m and, even 
when rank(HG) = m but HG is ill-conditioned, the spread (or dynamic range) of the diagonal 
elements of R is large. This entails large complexity of the tree search jT^j. Intuitively, when HG 
is ill-conditioned, the lattice generated by HG has a very skewed fundamental cell such that there 
are directions in which it is very difficult to distinguish the points {HGx : x 6 U}; 2) Enforcing 
the condition xGW, can be very difficult because a lattice code C(A, v, 1Z) with non-trivial shaping 
region 1Z might have an information set U with a complicated shape. Hence, just checking the 
condition x6W during the search may entail a significant complexity. 

Left preprocessing can be seen as an effort to tackle the first problem: it modifies the chan- 
nel matrix and the noise vector such that the resulting CLPS problem is non-equivalent to ML 
(therefore, it is suboptimal), but it has a much better conditioned "channel" matrix. The second 
problem can be tackled by relaxing the constraint set U to the whole Z m , i.e., searching over the 
whole lattice A instead of only the lattice code C (or lattice decoding). In general, lattice decoding 
is another source of suboptimality. Nevertheless, once the boundary region is removed, we have 
the freedom of choosing the lattice basis which is more convenient for the search algorithm. This 
change of lattice basis is accomplished by right preprocessing. Finally, the tree structure is ob- 
tained by factorizing the resulting combined channel-lattice matrix in upper triangular form, as 
in classical sphere decoding. Overall, left and right preprocessing combined with lattice decoding 
are a way to reduce complexity at the expense of optimality. Fortunately, it turns out that an 
appropriate combination of these elements yields very significant saving in complexity with very 
small degradation with respect to the ML performance. Thus, it yields a very attractive decoding 
solution. While the outstanding performance of appropriate preprocessing and lattice decoding can 
be motivated via rigorous information theoretic arguments fHl El 120] , here we are more concerned 
with the algorithmic aspects of the decoder and we shall give some heuristic motivation based on 
"signal-processing" arguments. 

Finally, we note that the notion of complexity adopted in this work does not capture the com- 
plexity of the preprocessing stage (mostly cubic in the lattice dimension) . In practice, this assump- 
tion is justified in slowly varying channels where the complexity of the preprocessing stage will be 
shared by many transmission frames (e.g., a wired ISI channel or a wireless channel with stationary 
terminals). If the number of these frames is large enough, i.e., the channel is slow enough, the 
preprocessing complexity can be ignored compared to the complexity of the tree search stage which 
has to be independently performed in every frame. Optimizing the complexity of the preprocessing 
stage, however, is an important topic, especially for fast fading channels. 

3.1 Taming the Channel: Left Preprocessing 

In the case of uncoded transmission (G = I), QR decomposition of the channel matrix H (assuming 
rank(H) = m) allows simple recursive detection of the information symbols x. Indeed, Q is the 
feedforward matrix of the zero-forcing decision feedback equalizer (ZF-DFE) [2*7] . In general, sphere 
decoders can be seen as ZF-DFEs with some reprocessing capability of their tentative decisions. 

It is well-known that ZF-DFE is outperformed by the MMSE-DFE in terms of signal-to-interference 
plus noise ratio (SINR) at the decision point, under the assumption of correct decision feedback [H]. 
This observation motivates the proposed approach for left preprocessing ^3]. This new matrix can 
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be obtained through the QR decomposition of the augmented channel matrix 



H 



H 
I 



QRi 



where Q G ^( n + m ) xm has orthonormal columns and Ri is upper triangular. Let Qi be the upper 
n xm part of Q. Qi and Ri are the MMSE-DFE forward and backward filters, respectively. Thus, 
the transformed channel matrix and the received sequence are given by Ri and y' = (Q x r — Riv), 
respectively. The transformed CLPS 



mm 



y' - RiGx 



(12) 



is not equivalent to (J5J) since, in general, Qi does not have orthonormal columns. The additive noise 
w = y' — RxGx in (J12)) contains both a Gaussian component, given by Q x z, and a non-Gaussian 
(signal-dependent) component, given by (Q x H — Ri)(c + v). Nevertheless, for lattice codes such 
that cov(c + v) =1, it can be shown that cov(w) = I |18j . Hence, the additive noise component w 
in (j!2|) is still white, although non-Gaussian and data dependent. Therefore, the minimum distance 
rule ()12|) is expected to be only slightly suboptimal. 2 On the other hand, the augmented channel 
matrix H in (fTT|) has always rank equal to m and it is well conditioned, since RiRi = I + H H. 
Therefore, in some sense we have tamed the channel at the (small) price of the non-Gaussianity of 
the noise. The better conditioning achieved by the MMSE-DFE preprocessing is illustrated in FigJ2] 
(b) and (c). 



3.2 Inducing Sparsity: Right Preprocessing 

In order to obtain the tree structure, one needs to put R X G in upper triangular form R via QR 
decomposition. The sparser the matrix R, the smaller the complexity of the tree search algorithm. 
For example, a diagonal R means that symbol- by-symbol detection is optimal, i.e., the tree search 
reduces to exploring a single path in the tree. Loosely, if one adopts a depth first search strat- 
egy, then a sparse R will lead to a better quality of the first leaf node found by the algorithm. 3 
Consequently, the algorithm finds the closest point in a shorter time |14j . 

While we have no rigorous method for relating the "sparsity" of R to the complexity of the tree 
search, inspired by decision feedback equalization in ISI channel, we define the sparsity index of the 
upper triangular matrix R as follows 

S(R) = max *' J . (13) 

ie{i,...,m} rf fi 

where denotes the (i,j)-th element of R. One can argue that the smaller S(R) the sparser R 
(e.g., S(R) = for R diagonal). The goal of right preprocessing is to find a change of basis of the 
lattice {RiGx : x G Z m }, such that the new lattice generator matrix, S, satisfies S = QR with 

2 This argument can be made rigorous by considering certain classes of lattices of increasing dimension, Voronoi 
shaping and random uniformly distributed dithering common to both the transmitter and the receiver, as shown in 

[TBI EH. 

3 More details on the different search strategies are reported in Section |U 
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S(R) as small as possible. This amounts to finding a unimodular matrix T (i.e., the entries of T and 
T _1 are integers) such that R4G = QRT with Q unitary and S(R) minimized over the group of 
unimodular matrices. This optimization problem appears very difficult to solve; however, there exist 
many heuristic approaches to find unimodular matrices that give small values of S(R). Examples 
of such methods, considered here, are lattice reduction, column permutation and a combination 
thereof. 

Lattice reduction finds a reduced lattice basis, i.e., the columns of the reduced generator matrix 
S have "minimal" norms and are as orthogonal as possible. 4 The most widely used reduction 
algorithm is due to Lenstra, Lenstra and Lovasz (LLL) [35] and has a polynomial complexity 
in the lattice dimension. An enhanced version of the LLL algorithm, namely the deep insertion 
modification, was later proposed by Schnorr and Euchner LLL with deep insertion gives a 

reduced basis with significantly shorter vectors In practice, the complexity of the LLL with 
deep insertion is similar to the original one even though it is an exponential time algorithm in the 
worst case sense 

Another method for decreasing S(R) consists of ordering the columns of RiG, i.e., by right- 
multiplication by a permutation matrix S. In the sequel, we shall use the V-BLAST greedy ordering 
strategy proposed in |27llo r |. This algorithm finds a permutation matrix XI such that HiG = QRS 
maximizes min^r^. Since R R = £ _T G R 1 RiG£~ 1 , i.e., the set {^V rf^ : i = l,...,m} 
depends only on RiG and not on S, by maximizing the minimum rf i this algorithm miminizes 
S(R) over the group of permutation matrices (a subgroup of the unimodular matrices). 

Lattice reduction and column permutation can be combined. This yields an unimodular matrix 
T = 53Ti, where T 1 is obtained by lattice-reducing R X G and £ by applying the V-BLAST greedy 
algorithm on the resulting reduced matrix RiGTf 1 . 

As observed before, the unimodular right multiplication does not change the lattice but may 
significantly complicate the boundary control. In fact, we have 



mm 



y'-RiGx 



mm 



mm 



y' - QRTx 

QV - Rx (14) 



The new constraint set TU might be even more complicated to enforce than the original information 
set U (see Fig. El^d)). However, it is clear that although modifying the boundary control may result 
in a significant complexity increase for ML decoding, lattice decoding is not affected at all, since 
TZ m = Z m . 



3.3 Forming the Tree 

The final step in preprocessing is to expose the tree structure of the problem. In this step, QR 
decomposition is applied on the transformed combined channel and lattice matrix Q x HGT ! , after 
left and right preprocessing. The upper triangular nature of R means that a tree search can now 
be used to solve the CLPS problem. Fig. [T] illustrates an example of such a tree. 

Here, we wish to stress that our approach for exposing the tree is fundamentally different from 
the one traditionally used for codes over finite alphabets (e.g., linear block codes, convolutional 

4 For more details on the different notions and methods of lattice reduction, the reader is referred to HJ. 
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codes, trellis coset codes in AWGN channels). Here, we operate over the field of real numbers 
and consider the lattice corresponding to the joint effect of encoding and channel distortion. In 
the conventional approach, the tree is generated from the trellis structure of the code alone, and 
hence, does not allow for a natural tree search that handles jointly detection (the linear channel) 
and decoding. In fact, joint detection and decoding is achieved at the expenses of an increase of 
the overall system memory (joint trellis), or by neglecting some paths in the search (e.g., by per- 
survivor reduced state processing). Since operating on the full joint trellis is usually too complex, 
both the proposed and the conventional per-survivor (reduced state) approach are suboptimal, and 
the matter is to see which one achieves the best performance /complexity tradeoff. 

For the sake of convenience, in the following we shall denote again by y the channel output after 
all transformations, i.e., the tree search is applied to the CLPS problem min xg ^™ |y — Rx| 2 with R 
in upper triangular form. The components of vectors and matrices are numbered in reverse order, 
so that the preprocessed received signal can finally be written as 



V yi J 



( T. 



\ 



m,m 









I'm— l.m— 1 







r m-l,l 
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J 



(15) 



Notice that after preprocessing the problem is always squared, of dimension m, even though the 
original problem has arbitrary m and n. Throughout the paper, we consider a tree rooted at a fixed 
dummy node x . The node at level k is denoted by the label xj = (xx,X2, ••.,#&)■ Moreover, every 
node x^ is associated with the the squared distance 



Wfc(x^) 



Vk 



k 



(16) 



The difference between the transmitted codeword x and any valid codeword x is denoted by x, 
i.e., x = x — x. 

We hasten to stress that the preprocessing steps highlighted in Sections I3.1H3.3I are for a general 
setting. In some special cases, some steps can be eliminated or alternative options can be used. 
Some of these cases are listed hereafter. 

1. Upper Triangular Code Generator Matrix: 

In this case, after taming the channel, H — > Ri, the new combined matrix RiG is also upper 
triangular and can be directly used to form the tree without any further preprocessing (if one 
decides against right preprocessing). 

2. Uncoded V-BLAST: 

For uncoded V-BLAST systems (i.e., G = I), applying the MMSE-DFE greedy ordering of 
|32[ Ej may achieve better complexity of the tree search stage than applying MMSE-DFE left 
preprocessing, lattice reduction, and greedy ordering of the final QR decomposition. This is 
especially true for large dimensions, where lattice reduction is less effective 



3. 



The Hermite Normal Form Transformation: 

Ultimately, any hardware implementation of the decoder requires finite arithmetics. In this 
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case, all quantities are scaled and quantized such that they take on integer values. While all 
the preprocessing steps in Sections I3.1H3.3I can be easily adapted to finite arithmetics, there 
exist other efficient transformations for integral matrices that may yield smaller complexity 
over the ones mentioned above. For example, one can apply the Hermite normal form (HNF) 
jHj directly on the scaled (and quantized) matrix rHG, such rHG = RT, with T unimodular 
and R upper triangular with the property that each diagonal element dominates the rest of 
the entries on the same row (i.e., r i;i > r^j > 0, i — 1, . . . , m, j — i + 1, . . . , m). Interestingly, 
the HNF transformation improves the sparsity index and reduces the preprocessing to a single 
step. 

4 The Tree Search Stage 

After proper preprocessing, the second stage of the CLPS corresponds to an instance of searching 
for the best path in a tree. In this setting, the tree has a maximum depth m, and the goal is to 
find the node(s) at level m that has the least squared distance, where the squared distance for any 
node at level m (called leaf node) is given by 

m 

rf 2 ( X)X r) = 5>(xi) (I?) 

i=i 

Visiting all leaf nodes to find the one with the least metric, is either prohibitively complex (ex- 
ponential in m), or not possible, as with lattice decoding. The complexity of tree search can be 
reduced by the branch and bound (BB) algorithm which determines if an intermediate node x^, on 
extending, has any chance of yielding the desired leaf node. This decision is taken by comparing 
the cost function assigned to the node by the search algorithm, against a bounding function. In 
the following section, we propose a generic tree search stage, inspired by the BB algorithm, that 
encompasses many known algorithms for CLPS as its special cases. We further use this algorithm 
to classify various tree search algorithms and elucidate some of their structural properties. 

4.1 Generic Branch and Bound Search Algorithm 

Before describing the proposed algorithm, we first need to introduce some more notation. 

• ACTIVE is an ordered list of nodes. 

• /(x^) G 1 is the cost function of any node x^ in the tree, and t G IR mxl is the bounding 
function. 

• Any node x.\ in the search space of the search algorithm is a valid node, if /(x^) < tk- 

• A node is generated by the search algorithm, if the node occupies any position in ACTIVE at 
some instant during the search. 

• u sorf is a rule for ordering the nodes in the list ACTIVE. 

• "gen" is a rule defining the order for generating the child nodes of the node being extended. 
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• gi and g 2 are rules for tightening the bounding function. 

• At any instant, the leaf node with the least distance generated by the search algorithm so far 
in the search process is stored in x. 

• We define the search complexity of a tree search algorithm as the number of nodes generated 
by the algorithm. 

• Two search algorithms are said to be equivalent if they generate the same set of nodes. 

• A BB algorithm whose solution is guaranteed to be (one of) leaf node(s) with least distance 
is called an optimal BB algorithm. If the solution is not guaranteed to have the least distance 
to y among all leaf nodes, then the BB algorithm is a heuristic BB algorithm. 

We are now ready to present our generic search algorithm. 

GBB(/, t, sort, gen, g u g 2 ): 

1. Create the empty list ACTIVE, and place the root node in ACTIVE. Set n c <— 1. 

2. Let be the top node of ACTIVE. 
Ifx-i is a leaf node (k — m), then 

t <- 9i(t, f(x?)) and x <- argmin(J]™ 1 Wi(x[), YJLi w i(*i))- 
Remove x^ from ACTIVE. 
Go to step 4. 

is not a valid node, then remove x^ from ACTIVE. Go to step 4. 

If all valid child nodes of xj have already been generated, then remove x^ from ACTIVE. Go 
to step 4. 

Generate a valid child node x^ +1 of xj, not generated before, according to the order gen, and 
place it in ACTIVE. Set n c ^n c + 1. Set t <- g 2 (t, n c , ACTIVE). Update /(xf), /(x* +1 ). 

3. Sort the nodes in ACTIVE according to sort. 

4. //ACTIVE is empty, then exit. Else, Go to step 2. 

In GBB, g>i allows one to tighten the bounding function when a leaf node reaches the top of 
ACTIVE, whereas g 2 allows for restricting the search space in heuristic BB algorithms. For example, 
setting 

T 

g 2 (t, n c>t , ACTIVE) = [-oo, -oo, -oo] , 

will force the search algorithm to terminate when the number of nodes generated increases beyond a 
tolerable limit on the complexity given by n C)t . Whenever a leaf node reaches the top of ACTIVE, x 
is updated if appropriate. Now, we use GBB to classify various tree search algorithms in three broad 
categories. This classification highlights the structural properties and advantages/disadvantages of 
the different search algorithms. 
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4.1.1 Breadth First Search 

GBB becomes a Breadth First Search (BrFS) if gi(t,/(x™)) = t, and the cost function / of any 
node, once determined, is never updated. Ultimately, all nodes xj 1 whose cost function along the path 

does not rise above the bounding function, are generated before the algorithm terminates, unless 
the g 2 function removes their parent nodes from ACTIVE. Now, we can establish the equivalence 
between various sphere/sequential decoders and BrFS. 

The first algorithm is the Pohst enumeration strategy reported in (23] ■ In this strategy, the 
bounding function t consists of equal components Co, where Co is a constant chosen before the 
start of search 5 , and the cost function of a node x^ is /(xf) = Yli=i w i(*-i)- Therefore, all nodes x^ 
in the search space that satisfy 

k 

^2wi(x{)<C (18) 

i=i 

are generated before termination. Generating the child nodes in this strategy is simplified by the 

fe+i 

following observation. For any parent node x*, the condition i«i(x{) < Co for the set of generated 

i=l 

child nodes implies that the (k + 1)— th component of the generated child nodes lies in some interval 
[a , ai]. The second example is the statistical pruning (SP) decoder which is equivalent to a heuristic 
BrFS decoder. Two variations of SP are proposed in [3U], the increasing radii (IR) and elliptical 
pruning (EP) algorithms. The IR algorithm is a BrFS with the bounding function t = {t\, ...,t m }, 
where 1 < k < m are constants chosen before the start of search. The cost function for any 
node in IR is the same as in Pohst enumeration. The EP algorithm is given by the bounding 

function t = {1,...,1}, and the cost function for the node x^ given by/(xf)=y^^, where 

i=i 

Cfc, 1 < k < m are constants. More generally, when g 2 (.) = t, i.e., g 2 is not used, the resulting 
BrFS algorithm is equivalent to the Wozencraft sequential decoder [37| where, depending on the 
cost function, the decoder can be heuristic or optimal. 

The M algorithm [2] and T-algorithm |2E! are also examples of heuristic BrFS. Here, however, 
g 2 serves an important role in restricting the search space. In both algorithms, sort is defined as 
follows. Any node in ACTIVE at level k is placed above any node at level k + 1, and nodes in 
the same level are sorted in ascending order of their cost functions. In the M-algorithm, after 
the first node at level k + 1 is generated (indicating that all valid nodes at level k have already 
been generated), g 2 sets tk to the cost function of the M-th node at level k (where M is an initial 
parameter of the M-algorithm). In the T-algorithm, g 2 sets tj~ to (/(x*) + T), where x^ is the 
top node at level k in ACTIVE, and T is a parameter of the T-algorithm. After t k is tightened in 
this manner, all nodes in ACTIVE at level k, that satisfy /(x*) > are rendered invalid, and are 
subsequently removed from ACTIVE. 

In general, BrFS algorithms are naturally suited for applications that require soft-outputs, as 
opposed to a hard decision on the transmitted frame. The reason is that such algorithms output 
an ordered 6 list of candidate codewords. One can then compute the soft-outputs from this list 

5 For the sake of simplicity, we assumed in the above classification that the bounding function is chosen such that 
at least one leaf node is found before the search terminates. If, however, no leaf node is found before the search 
terminates, the bounding function is relaxed and the search is started afresh. 

6 The list is ordered based on the cost function of the different candidates 



13 



using standard techniques (e.g., |H2|,[I|). Here, we note that in the proposed joint detection and 
decoding framework, soft outputs are generally not needed. Another advantage of BrFS is that the 
complexity of certain decoders inspired by this strategy is robust against variations in the SNR and 
channel conditions. For example, the M-algorithm has a constant complexity independent of the 
channel conditions. This property is appealing for some applications, especially those with hard 
limits on the maximum, rather than average, complexity. On the other hand, decoders inspired by 
the BrFS strategy usually offer poor results in terms of the average complexity, especially at high 
SNR. One would expect a reduced average complexity if the bounding function is varied during the 
search to exploit the additional information gained as we go on. This observation motivates the 
following category of tree search algorithms. 

4.1.2 Depth First Search 

GBB becomes a depth first search (DFS) when the following conditions are satisfied. The sorting 
rule sort orders the nodes in ACTIVE in reverse order of generation, i.e., the last generated node 
occupies the top of ACTIVE, and 

<7i(t,/(xD) = [min(t 1 ,/(xr)),...,min(t m ,/(xr))] T . 

As in BrFS, the cost function of any node, once generated, remains constant. Even among algorithms 
within the class of DFS algorithms, other parameters, like gen and #2, can significantly alter the 
search behavior. To illustrate this point, we contrast in the following several sphere decoders which 
are equivalent to DFS strategies. 

The first example of such decoders is the modified Viterbo-Boutros (VB) decoder reported in 
[T4] . In this decoder, g 2 (t,w c ) = t, and the cost function for any node x^ is /(x^) = Yli=i w i(' x i)- 
For any node and its corresponding interval [ao, Oi] for valid child nodes, the function gen 
generates the child node with ao as its (k+ l) th component first. Our second example is the Schnorr 
Euchner (SE) search strategy first reported in pQ. This decoder shares the same cost functions and 
#2 with the modified VB decoder, but differs from it in the order of generating the child nodes. For 
any node xj 1 and its corresponding interval [ao, aj for the valid child nodes, let a m — [ a °^ ai ] and 

5 — sign(wfc + i(x^ +1 )). Then, the function gen in the SE decoder generates nodes according to the 
order {a m , a m + 5,a m - 5, a m + 25,. . .}. 

Due to the adaptive tightening of the bounding function, DFS algorithms have a lower average 
complexity than the corresponding BrFS algorithms with the same cost functions, especially at 
high SNR. Another advantage of the DFS approach is that it allows for greater flexibility in the 
performance-complexity tradeoff through carefully constructed termination strategy. For example, 
if we terminate the search after finding the first leaf node, i.e., n c = m, then we have the MMSE- 
Babai point decoder This decoder corresponds to the MMSE-DFE solution aided with the 

right preprocessing stage. It was shown in ^3] that the performance of this decoder is within a 
fraction of a dB from the ML decoder in systems with small dimensions. The fundamental weakness 
of DFS algorithms is that the sorting rule is static and does not exploit the information gained thus 
far to speed up the search process. 

4.1.3 Best First Search 

GBB becomes a best first search (BeFS) when the following conditions are satisfied. The nodes in 
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ACTIVE are sorted in ascending order of their cost functions, and 

9l (t,f(x?)) = [min(t 1 ,/(xr)),...,min(t m ,/(xr))] T . 

Note that in BeFS, the search can be terminated once a leaf node reaches the top of the list, since 
this means that all intermediate nodes have cost functions higher than that of this leaf node. Thus, 
the bounding function is tightened just once in this case. The stack algorithm is an example of 
BeFS decoder obtained by setting g 2 (t,n c ) = t, and the cost function of any node in ACTIVE at 
any instant defined as follows: If xf{ is a leaf node, then /(x^) = — oo. Otherwise, we let x^ 1 be 

the best child node of x^ not generated yet, and define /(x^) = Yl^i w i( x i~!g 1 ) ~ + 1)> where 
we refer to b G M + as the bias. Because of the efficiency of the sorting rule, BeFS algorithms are 
generally more efficient than the corresponding BrFS and DFS algorithms. This fact is formalized 
in the following theorems. Theorem 1 establishes the efficiency of the stack decoder with 6 = 
among all known sphere decoders. 

Theorem 1 The stack algorithm with 6 = generates the least number of nodes among all 
optimal tree search algorithms. 

The following result compares the heuristic stack algorithm, i.e., b > 0, with a special case of 
the IR algorithm j^D], where the bounding function takes the form tf. = bk + 5. 

Theorem 2 The IR algorithm with cost function {t : t k = bk + 5}, generates at least as many 
nodes as those generated by the stack algorithm when the same bias b is used. 

Proof: Appendix O 

At this point, it is worth noting that in our definition of search complexity, we count only the 
number of generated nodes, i.e., nodes that occupy some position in ACTIVE at some instant. In 
general, this is a reasonable abstraction of the actual computational complexity involved. However, 
in the stack algorithm, for each node generated, the cost functions of two nodes are updated 
instead of one; one for the generated node, and one for the parent node. Thus, the comparisons in 
Theorems ^ and El are not completely fair. 

Finally, we report the following two advantages offered by the the stack algorithm. First, it 
offers a natural solution for the problem of choosing the initial radius (or radii), which is commonly 
encountered in the design of sphere decoders (e.g., [Ej). By setting all the components of t to oo, it 
is easy to see that we are guaranteed to find the closest lattice point while generating the minimum 
number of nodes (among all search algorithms that guarantee finding the closest point). Second 
it allows for a systematic approach for trading-off performance for complexity. To illustrate this 
point, if we set b = 0, we obtain the closest point lattice decoder (i.e., best performance but highest 
complexity). On the other extreme, when b — > oo, the stack decoder reduces to the MMSE-Babai 
point decoder discussed in the DFS section (the number of nodes visited is always equal to m). 
In general, for systems with small m, one can obtain near-optimal performance with a relatively 
large values of b. As the number of dimensions increases, more complexity must be expended (i.e., 
smaller values of b) to approach the optimal performance. 
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4.2 Iterative Best First Search 

In Section 14.11 our focus was primarily devoted to complexity, denned as the number of nodes 
visited by the tree search algorithm. Another important aspect is the memory requirement entailed 
by the search. Straightforward implementation of the GBB algorithm requires maintaining the list 
ACTIVE, which can have a prohibitively long length in certain application. This motivates the 
investigation of modified implementations of these search strategies that are more efficient in terms 
of storage requirements. The BrFS and DFS sphere decoders discussed in Sections 14.1.11 and 14.1.21 
lend themselves naturally to storage efficient implementations. Such implementations have been 
reported in |23J El US E HI] . 

In order to exploit the complexity reduction offered by BeFS strategy in practice, it is therefore 
important to seek modified memory-efficient implementations of such algorithms. This can be 
realized by storing only one node at a time, and allowing nodes to be visited more than once. The 
search in this case progresses in contours of increasing bounding functions, thus allowing more and 
more nodes to be generated at each step, finally terminating once a leaf node is obtained. The Fano 
decoder [22] is the iterative BeFS variation of the stack algorithm. Although the stack algorithm 
and the Fano decoder, with the same cost functions, generate essentially the same set of nodes [29J, 
the Fano decoder visits some nodes more than once. However, the Fano decoder requires essentially 
no memory, unlike the stack algorithm. Appendix[X]provides an algorithmic description of the Fano 
decoder and a brief description of the relevant parameters. Overall, the proposed decoder consists 
of left preprocessing (MMSE-DFE) and right preprocessing (combined lattice reduction and greedy 
ordering), followed by the Fano (or stack) search stage for lattice, not ML, decoding. 

5 Analytical and Numerical Results 

To illustrate the efficiency and generality of the proposed framework, we utilize it in three distinct 
scenarios. First, we consider uncoded transmission over MIMO channels (i.e., V-BLAST). Here, 
we present analytical, as well as simulation, results that demonstrate the excellent performance- 
complexity tradeoff achieved by the proposed Stack and Fano decoders. Then, we proceed to coded 
MIMO systems and apply tree search decoding to two different classes of space-time codes. Finally, 
we conclude with trellis coded transmission over ISI channels. 

5.1 The V-BLAST Configuration 

Unfortunately, analytical characterization of the performance- complexity tradeoff for sequential/ sphere 
decoders with arbitrary HG and U still appears intractable. To avoid this problem, we restrict our- 
selves in this section to uncoded transmission over flat Rayleigh MIMO channels. In our analysis, 
we further assume that ZF-DFE pre-processing is used. The complexity reductions offered by the 
proposed preprocessing stage are demonstrated by numerical results. 

Theorem 3 The Stack algorithm and the Fano decoder with any finite bias b, achieve the same 
diversity as the ML decoder when applied to a V-BLAST configuration. 

Proof : Appendix [DJ 

The result shows that the Fano decoder, unlike other heuristic algorithms like nulling-and- 
canceling, does not lead to a lower diversity than the ML decoder. 
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Theorem 4 In a V-BLAST system with Q 2 -QAM, the average complexity per dimension of the 
stack algorithm for a sufficiently large bias b is linear in m when the SNR p grows linearly with m 
and r = n — m > 0. 

Proof : Appendix |E| 

Thus, one can achieve linear complexity with the stack algorithm by allowing the SNR to 
increase linearly with the lattice dimension. To validate our theoretical claims, we further report 
numerical results in selected scenarios. In our simulations, we assume that the channel matrix is 
square and choose the SE enumeration as the reference sphere decoder for comparison purposes. In 
all the figures, the subscript Z refers to ZF-DFE left preprocessing and the subscript M denotes 
MMSE-DFE left preprocessing followed by LLL reduction and V-BLAST greedy ordering for right 
preprocessing. In Fig. |3J the average complexity per lattice dimension and frame error rate of Fano 
decoder with b = 1 and the SE sphere decoder are shown for different values of SNR in a 20 x 20 
16— QAM V-BLAST system. Thus, for m = 40, the Fano decoder can offer a reduction in complexity 
up-to a factor of 100. Moreover, the performance of the the Fano decoder is seen to be only a 
fraction of a dB away from that of the SE decoder, which achieves ML performance. We also see 
that the frame error rate curves for both the Fano decoder and the SE (ML) decoder have the same 
slope in the high SNR region, as expected from our analysis. Fig. H] compares the complexity and 
performance of the Fano decoder with ZF-DFE and MMSE-DFE based preprocessing, respectively, 
in a 30 x 30 4-QAM V-BLAST system (i.e., m = 60). From the figures, we see that the MMSE- 
DFE based preprocessing plays a crucial role in lowering the search complexity of the Fano decoder, 
despite the apparent increase in search space due to lattice decoding. Fig. El reports the dependence 
of the complexity of the Fano decoder on the value of b. The complexity attains a local minimum 
for some b* > 1, and for large values of b, the complexity of the Fano decoder decreases as b is 
increased. The error rate, however, increases monotonically with b and approaches that of the 
MMSE-DFE Babai decoder as b — > oo. For small dimensions, the performance of the MMSE-DFE 
based Babai decoder is remarkable. This DFS decoder terminates after finding the first leaf node. 
Fig. El compares the performance of this decoder with the ML performance for a 4 x 4, 4— QAM 
V-BLAST system. We also report the performance of the Yao-Wornell and Windpassinger-Fischer 
( YWWF) decoder which has the same complexity as the MMSE-DFE Babai decoder 021 EE] • It is 
shown that the performance of the proposed decoder is within a fraction of a dB from that of ML 
decoder, whereas the algorithm in 021 EE] exhibits a loss of more than 3 dB. 

5.2 Coded MIMO Systems 

In this section, we consider two classes of space-time codes. The first class is the linear dispersion 
(LD) codes which are obtained by applying a linear transformation (over C) to a vector of PAM 
symbols. For convenience, we follow the set-up of Dayal and Varanasi JE] where two variants of the 
threaded algebraic space-time (TAST) constellations ^2| are used in a 3 x 1 MIMO channel. This 
setup also allows for demonstrating the efficiency of the MMSE-DFE frontend in solving under- 
determined systems. In [TE], the rate-1 TAST constellation uses 64-QAM inputs at a rate of one 
symbol per channel use. The rate-3 TAST constellation, on the other hand, uses 4-QAM inputs to 
obtain the same throughput as the rate-1 constellation. As observed in jT^j, one obtains a sizable 
performance gain when using rate-3 TAST constellation under ML decoding. The main disadvan- 
tage, however, of the rate-3 code is that it corresponds to an under-determined system with 6 excess 
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unknowns which significantly complicates the decoding problem. Fig. [7| shows that the performance 
of the proposed MMSE-DFE lattice decoder is less than 0.1 dB away from the ML decoder for both 
cases. In order to quantify the complexity reduction offered by our approach, compared with the 
generalized sphere decoder (GSD) used in [TH], we measure the average complexity increase with 
the excess dimensions. If we define 

A Average complexity of decoding rate-3 constellation . . 

Average complexity of decoding rate-1 constellation' 

then a straightforward implementation of the GSD, as outlined in ^T] for example, would result in 
7 = (9(4 6 ). In fact, even with the modification proposed in ^B], Dayal and Varanasi could only 
bring this number down to 7 = 460 at an SNR of 30 dB. In Table lE~3l we report 7 for the proposed 
algorithm at different SNRs, where one can see the significant reduction in complexity (i.e., from 
460 to 12 at an SNR of 30 dB). Based on experimental observations, we also expect this gain in 
complexity reduction to increase with the excess dimension m — n. 

The second space-time coding class is the algebraic codes proposed in [HH EH EOj • This approach 
constructs linear codes, over the appropriate finite domain, and then the encoded symbols are 
mapped into QAM constellations. The QAM symbols are then parsed and appropriately distributed 
across the transmit antennas to obtain full diversity. It has been shown that the complexity of ML 
decoding of this class of codes grows exponentially with the number of transmit antennas and 
data rates. Here, we show that the proposed tree search framework allows for an efficient solution 
to this problem. Figure |S1 shows the performance of MMSE-DFE lattice decoding for two such 
constructions of space-time codes i.e., Golay space-time code for two transmit antennas and the 
companion matrix code for three transmit antennas [HI]- I n both case, the performance of the 
MMSE-DFE lattice decoder is seen to be essentially same as the ML performance. In the proposed 
decoder, we use the lattice A obtained from underlying algebraic code through construction A. The 
ML performance, obtained via exhaustive search in Figure |H1 is not feasible for higher dimensions 
due to exponential complexity in the number of dimensions. 

5.3 Coded Transmission over ISI Channels 

In this section, we compare the performance of the MMSE-Fano decoder with the Per-Survivor- 
Processing (PSP) algorithm for convolutionally coded transmission over ISI channels. Our MMSE- 
Fano decoder uses the construction A lattice obtained from the convolutional code. For this scenario, 
it is known that PSP achieves near-ML frame error rate performance [7j. Figure El compares the 
Frame and Bit Error Rates for a 4— state, rate 1/2 convolutional code with generator polynomials 
given by (5, 7) and code length 200, over a 5— tap ISI channel. The channel impulse response 
was chosen as (0.848,-0.424,0.2545,-0.1696,0.0848). The Fano decoder with b = 1 and stepsize 
1 is seen to achieve essentially the same performance as the PSP algorithm for this code, with 
reasonable search complexity over the entire SNR range. We again note that the loss in lattice 
decoding as opposed to finite search space is negligible, due to MMSE-DFE preprocessing of the 
channel prior to the search. Moreover, the complexity of PSP algorithm, although linear in frame 
length, increases exponentially with the constraint length of the convolutional code used, while 
that of the Fano decoder is essentially independent of the constraint length. Figure 01 also shows 
the performance of the Fano decoder for a rate 1/2, 1024-state convolutional code with generator 
polynomials (4672,7542), with the same frame size. Due to the increased constraint length, the 
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performance is significantly better (with almost no increase in complexity). The complexity of PSP 
algorithm, on the other hand, is significantly higher for this code. 

6 Conclusions 

A central goal of this paper was to introduce a unified framework for tree search decoding in wireless 
communication applications. Towards this end, we identified the roles of two different, but inter- 
related, components of the decoder, namely; 1) Preprocessing and 2) Tree Search. We presented a 
preprocessing stage composed of MMSE-DFE filtering for left preprocessing and lattice reduction 
with column ordering for right preprocessing. We argued that this preprocessor allows for ignoring 
the boundary control in the tree search stage while entailing only a marginal loss in performance. 
By relaxing the boundary control, we were able to build a generic framework for designing tree 
search strategies for joint detection and decoding. Within this framework, BeFS emerged as a 
very efficient solution that offers many valuable advantages. To limit the storage requirement 
of BeFS, we re-discovered the Fano decoder as our proposed tree search algorithm. Finally, we 
established the superior performance-complexity tradeoff of the Fano decoder analytically in a V- 
BLAST configuration and demonstrated its excellent performance and complexity in more general 
scenarios via simulation results. 

A The Fano Decoder 

In this section, we obtain the cost function used in the proposed Fano/Stack decoder from the Fano 
metric defined for tree codes over general point-to-point channels, and give a brief description of 
the Fano decoder and its properties. 

A.l Generic Cost Function of the Fano Decoder 

For the transmitted sequence x, let 



be the system model, as in Sectional In (J2UJ) . the noise sequence w is composed of i.i.d Gaussian 
noise components with zero mean and unit variance. 

For a general point-to-point channel with continuous output, the Fano metric of the node 
can be written as |37] 



where 7i(x^) is the hypothesis that xj form the first k symbols of the transmitted sequence. 

For 1 < k < m, if Pr(7i(x^)) is uniform over all nodes xj 1 that consist of the first k components 
of any valid codeword in C, from (}2T|) . the cost function for the Fano decoder for our system model 
(J2UJ) can be simplified as 



y = Rx + w 



(20) 




(21) 



/(xj) = -Mx?) = log J> 




+ 



Ei=i^i( x i) 



2 



(22) 
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Since summation over in (j22j) is not feasible, we use the following approximations: first, log(J^ a,) p 
log(max(aj)), so the sum can be approximated by the largest term. Second, for moderate to high 
SNRs, the transmitted sequence is actually the closest vector with a high probability, i.e., the 
largest term corresponds to the transmitted sequence. Thus, (}2"2"j) can be approximated as 



/ gju^ej) \ Iw^l 2 
logf^e"^ 1 (23) 

After averaging (J2*3j) over noise samples and scaling, we have, 

k 

In general, the cost function for the Fano/Stack decoder can be written in terms of the parameter 
b, the bias, as 

k 

/(x?) = $>,(x{)-^. 

3=1 

A. 2 The Algorithm 

The operation of the Fano decoder with no boundary control (lattice decoding) follows the following 
steps: 

• Step 1: (Initialize) Set k <— 0, T <— 0, x <— x . 

• Step 2: (Look forward) x^ +1 <— (xf, Xfc+i)> where x^+i is the (A; + l) th component of the best 
child node of x^. 

• Step 3: 

If /(x* +1 ) < T, 
If k + 1 = m (leaf node), then x = x™; exit. 
Else (move forward), k *— k + 1. 
If/(xt- 1 )>T-A, 
while /(xj) < T - A, T «- T - A (tighten threshold). 
Go to step 2. 
Else 

If (fc = or /(x^ 1 ) > T), T <- T + A (cannot move back, so relax threshold). 
Go to step 2. 

Else (move back and look forward to the next best node) 
x^ {x^ 1 ,^}, where is the last component of the next best child node of x^ -1 . 
k <— k — 1. 

Go to Step 3. □ 
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Note that T (i.e., the threshold) is allowed to take values only in multiples of the step size A 
(i.e., 0, ±A, ±2A, ...). When a node is visited by the Fano decoder for the first time, the threshold 
T is tightened to the least possible value while maintaining the validity of the node. If the current 
node does not have a valid child node, then the decoder moves back to the parent node (if the 
parent node is valid) and attempts moving forward to the next best node. However, if the parent 
node is not valid, the threshold is relaxed and attempt is made to move forward again, proceeding 
in this way until a leaf node is reached. 

The determination of best and next best child nodes is simplified in CLPS problem; the child 
node generation order gen in SE enumeration (section I4.1.2|) generates child nodes with cost func- 
tions in ascending order, given any node x*. 

A. 3 Properties of the Fano Decoder 

The main properties of the Fano decoder used in our analysis are [3Tj : 

1. A node is generated by the Fano decoder only if its cost function is not greater than the 
bound T. 

2. Let correct path be defined as the path corresponding to the transmitted codeword, and let 
Jm be the maximum cost function along the correct path. The bound T is always less than 
(/a/ + A), where A is the step-size of the Fano decoder; that is, max{T} < Tm — fhi + A. 

All nodes that are generated by the Fano decoder are necessarily those with cost function less 
than the bound T, by Property (1). However, even though the cost function of some node may 
be smaller than the bound, the node itself might not be visited when bound takes the value T. If 
any of the cost functions along the path {x^, r < k} increases above T, the node x^ is not generated 
and thus x^ is not visited. Hence, this is not a sufficient condition for a node to be generated. 

Moreover, in Property (2), the bound T is always lesser than (f' M +A), where f' M is the maximum 
cost function along any path of length m. A tight bound is obtained only when the maximum cost 
function corresponding to the path with the least f' M is chosen. However, ju along the transmitted 
path is usually easier to characterize statistically than f' M . 



B Properties of the Stack Decoder 

For any node x^ in the tree, let h(x.f) = Y2i=i Wi{yi\)—bk. For the stack algorithm, the cost function 
of any node in ACTIVE at any instant defined as follows: If x^ is a leaf node, then /(x^) = — oo. 
Otherwise, we let x^ 1 be the best child node of x^ not generated yet, and define /(x^) = h{x\ g ). 
We note that h of any node, once generated, remains constant throughout the algorithm, and / of 
any node is non- decreasing as the algorithm progresses. 

Proposition 1 Letx.™ = (x\, ...x m ) be the path chosen by the stack algorithm, andx.™ = {x\, ...,x m ) 
be any path in the tree. Then, 

max h(5c[) < max h(x.{) (24) 

\.<j<m 1 <j<m 
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Proof : On the contrary, assume there exists a path (xi,x 2 , x^, Xd+i, ■■■,x m ) that does not 
satisfy (|24|) . Here, the path is assumed to share the same nodes with the chosen path until level d, 
and diverges from the chosen path from level d + 1 onwards. Since this path does not satisfy (|24jl. 



max h(x.V 

d+l<j<m 



> max h(pc{) 

d+l<j<m 



(25) 



Let xj, k > d, be the node for which max rf+1 <,,< m h(x{) occurs. Then, we have, 

fc(xj) > h(x.{), d<j<m (26) 

Since x™ is the chosen path, the node x^ is generated at some instant before the search terminates. 

S'(xj), since x^ is the best child node of x^ -1 not generated 
the node xf with cost function /(xf) = /i(xf +1 ) appears at 



Just before x^ is generated, ^(x^ 1 ) - 
yet. Moreover, since hfef) > /i(xf +1 ) 
the top of the stack at some instant before xj is generated. Therefore, xf +1 is generated before x^ is 
generated. Since the search does not terminate before x^ is generated, applying the same argument, 
one sees that all the nodes xf +2 , ...,x™ are generated before x^ is generated. However, once x™ is 

as the chosen 



1 Xr 



generated by the stack algorithm, the search terminates, with (x\, ...,Xd,Xd+i, 
path. Since 1 < d < m can take any value, the inequality in (f2~4"j) is satisfied by all paths. 



Proposition 2 If 



then, the node xf is not generated. 



Proof : First, we show that if 



max /i(xf) > max hhd) 



(27) 



h(p4) > max h(x{), 

l<j<m 



(28) 



then xf is not generated. Let (J28|l be true, and assume xf is generated. Then, just before xf is 
generated, its parent node xf -1 is at the top of ACTIVE, with cost function /(xf -1 ) = /i(xf). 
However, since h(pcf) > h(5c[), 1 < j < m, all nodes along the chosen path are generated before 
xf is generated, and the hence the search terminates before xf is generated. Noting that xf can 
be generated only if all the nodes xj, ...,xf _1 are generated, and applying the same argument for 



d-l 



x\, we have (|2"7f). 



C Proof of Theorem |2l 

Let Air be the set of nodes generated by the IR algorithm, where the bounding function t has 
components given by tk = bk + 5. Let A s be the set of nodes generated by the stack decoder 
with the bias b. The IR algorithm in Theorem |2] can be defined with bounding function given 
by {tk = bk + 5, 1 < k < m}, and the cost function for any node x^ given by ^]^ =1 ^(xj), 

or equivalently, with the bounding function i fc = 5 and cost function ^X^Li^iC*!) — bk^j. If 5 is 

the bound of the IR algorithm, then any node x^ is generated by the algorithm if and only if all 
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the conditions |X/i=i w i{ x \) — b < 5, w i(' x i) — 2b < 5, Y^l=\ u 'j( x i) — bk < <5 j, are satisfied. 
Therefore, 



A- 



A IR = jx* : max ^( x i) ~ bk j < 6 j ■ ( 29 ) 

Moreover, 5 should be such that at least one sequence x e U is included within the search space. 7 
Let xj,R be a leaf node such that 

x/p = argmin ( max (wAjd) — bk) ) . (30) 
xew yi<fc<m v y y 

ie. ; x/ft has the least value of maximum cost function among all paths of length m. If 

5 < max (wi(xjRi) — bk) , 

l<j<m ' 

then noxeW lies within the search space, and the search space is empty. If lattice decoding is used, 
then the minimum in (j3Ti|) is taken over all x G Z m . Therefore, 5 > maxi<j< m (wi(xj R1 ) — bk). 
From Section El Prop. 1, the path chosen by the stack algorithm, x™ satisfies 

~bk))< (*« M**) - bk )) ( 31 ) 

where x™ is any other path. 
From l|5Djl and (j3T|). 

{ifk% ( Wi ^ - bk )) = (^1* - 6 *0 ) < 6 ( 32 ) 

From Proposition 2 and (J3TJ) . *4. s C Air. 



D Proof of Theorem El 

In this section, we derive an upper bound to the frame error rate for a V-BLAST system with 
uncoded input (with Q-PAM constellation for the components), for the Fano decoder that visits 
paths in the regular Q-PAM signal space. The preprocessing assumed here is QR transformation 
of H. 

Let Sf be the event that the Fano decoder makes an erroneous detection, conditioned on 
Tm — fhi + A. Then, P e = Er M (Pr(£f)) is the frame error rate of the Fano decoder. In this 
section, we derive an upper bound on P e . From property (2) in Section lA~3l T < (/m + A), where 
A is the step size of the Fano decoder. Any sequence x^x can be decoded as the closest point by 
the Fano decoder only if its cost function is lesser than Tm- One has 

y = Hx + z = Q^jx + z, (33) 
7 Otherwise, S is increased and search is repeated afresh 
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and therefore 



T 

Q y 



w 



r+1 



W 



(34) 



where r = n — m is the excess degrees of freedom in the V-BLAST system. Since the cost function 

m 

of a leaf node x™ is /(x™) = Wi(x.\) — bra = |Rx + w" +1 | 2 — bm, P{£f) can be upper bounded 



as 



i=l 



P(£ f ) < E Pr(E^'( x i)- &m<T ^ 

= £ Pr(|Rx + wP +1 | 2 <6m + / M + A) 



(35) 
(36) 



x£W,x^x 



where 



/m = max {0, \w r r +l\ 2 - b, | W ;+ 2 | 2 - 26, ... , |w r " +1 | 2 - mb] 

is the maximum cost function along the transmitted sequence path. The upper bound in (J3*5*j) 
follows from the union bound, and due to the fact that in general, /(x™) < T M is only a necessary 
condition for x™ to be decoded by the Fano decoder. 
The bound in (|36|) can be rewritten as 

2 



P(£ t ) < £ Pr f (*) 

xe«,x^x \ v 7 



x + w. 



< bm + f M + A + |w[ 



r|2 



- |w™| 2 < &m + / M + A - |w? +1 



x£W,x^x 

= J] Pr(|Hx| 2 + 2(Hx) T z < bm + f M + A - |w 

xgM,x^x 

< E Pr(|Hx| 2 + 2(Hx) T z < bm + A) 



n |2 



n |2\ 
r+1 1 J 



(37) 

(38) 
(39) 
(40) 



xgW,x^x 



since f M - |w 



n 12 
r+1 1 



{- 



max < — w 



r n 12 
'r+1 1 > 



W 



n 12 
r+2l 



—mb\ < 0. The bound in (|4U|) is now 



independent of the value of /m, and hence represents a bound on the frame error rate. Note that 
the corresponding expression in (j4T)j) for ML decoding is Pr(|Hx| 2 + 2(Hx) z < 0). For any xGW 
and x^x, let d 2 (x, x) = |Hx| 2 represent the squared Euclidean distance between the lattice points 
Hx and Hx. Then, 



Pr(|Hx + z| 2 - |z| 2 < mb + A) < 



- e (-i(^(x ) x)- mb -A)V^(x,x)) ) rf2(iC)X) >mb + A 



d 2 (x,x) < mb + A 



(41) 



by Chernoff bound. 

For <i 2 (x, x) > mb + A, equation (j4Tj) can be rewritten as 



Pr(|Hx + z| 2 - |z| 2 < mb + A) < e 

< e 



-§(d 2 (x,x)) (m6+A)/4 



(42) 
(43) 
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since e -M>+A) 2 /(8d 2 (x,x)) < , for \ > 0. Let 



q 4 mm (|H(x, - x,)| 2 ) (44) 



and let 



g{q) = { f-* eu ^° (45) 

1 q < mb + A 

Then, from (fHIj) and (jUJ, -P e < E g (g(q)). An upper bound on the probability density function 
(pdf) of q is given by jH] 

P(ff) < Px(ff) £ ( 7 t (46) 
fc=i ^ ' 



where p x (q) is the pdf of a scaled chi-square random variable with n degrees of freedom and mean 
— (i.e., a random variable that is the sum of squares of n i.i.d zero-mean Gaussian variables with 
variance — ). Then, (|43|) and (jHH) give 

poo rmb+A 

P e < Q m e (mb+A)/4 / e { - q/8) p(q)dq + / p(q)dq (47) 

Jmb+A JO 

J mb+A 2«/T(f)(7" r V 2a 2 '2^ 1 J 



/•oo (n/2-1) -(jr/(2o- 2 ) / t , A 



< ^™e~y o ^" '^/' ^^n^iJ (49) 

^gm e (fnHA)/4 f mb + A n\ 

+ n 9 ,- (50) 



(1+4) 



"/ 2 'V 2a 2 ' 2 



where a 2 = ^, A = — ( J is a constant independent of g or p, and 7(2, a) is the incomplete 

k=i ^ ' 

gamma function. If b is bounded (i.e., b < Ai < 00) Vp, then e ( - mb//4 - ) is also bounded for all p and 
finite m. The error performance of the Fano decoder can now be characterized by the sum of two 
terms. The dependence of the first error term on p is of the form p~(™/ 2 ) for large values of SNR, 
and hence has the same diversity as the ML decoder. The second term can also be bounded as 

< ( 1 - e -(^A)/<^)V B/a) (5i) 



2a 2 



mb + A \ (nm 



m(mb + A) \ (n/2) 

27 ) (53) 
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where (|51|) follows from the inequality j(x, a) < (1— e~ x ) a ( Appendix IE. 2|) . and (J52)l from (1— e~ x ) < 
x for x > 0. The second term also has the dependence p~( n / 2 \ and hence the Fano decoder achieves 
the same diversity as that of the ML decoder for this system. 

The above derivation also applies to the Stack algorithm, with minor modifications. Let £ s be 
the event that the stack algorithm makes an erroneous detection, conditioned on the value of J'm- 
Then, P e = Ef M (Pv(S s )) is the word error rate of the stack algorithm. Since any path x ^ x is 
decoded as the closest point by the stack algorithm only if h(x) = YlT=i w i( x \) ~ bm is not greater 
than f M (Prop. 1, Section P(£ s ) can be written as 

J2™M)-t™<fM> (54) 

j=l ) 

= Pr {|Hi + z| 2 < 6m + / M } (55) 

From (jo3|) and (j36j) . it is easy to see that the error probability expression for the stack algorithm 
is the same as that for the Fano decoder, when A = 0. Thus, the stack algorithm too achieves the 
same diversity as the ML decoder for a V-BLAST system, for any finite value of b. □ 



E Proof of Theorem |U 

The following are required for the proof. 
E.l Wald's inequality 

Let So = 0, Si, S2, ... be a random walk, with Sj = Y2l=i -^*> where AjS are i.i.d random variables 
such that Pr(Xi > 0) > 0, Pr(X^ < 0) > 0, and E{X t ) < 0. Let g(X) = E(e xx *) be the moment 
generating function of Xj. Let Ao > be a root of g(X) = 1. Then, from Wald's identity |37j . 

Pr(S max > u) < e~ XoU (56) 

where S max = maxj(Sj). 

For the random walk with X, = wf — b, where Wi ~ A/"(0, 1), the above conditions are satisfied 
if b > 1. The moment generating function for X, = wf — b is given by 

-A6 

S(A) = VTTTx (57) 

From ()57p. A > can be found as the positive root of the equation 

-2\b = log(l - 2A). 

Notice that since log(l — 2A) decreases from to —00 as A increases from to |, Ao satisfies 
A G (0,0.5). Since max <j< m Sj < maXj> Sj, the bound in (J5T)|) is also valid for any stopped 
random walk. 
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E.2 Upper bounds on 7 

For a scaled chi-square random variable X with k degrees of freedom and mean ka 2 , 



Pr(X < (3) = 7 



P k 



2a 2 ' 2, 

where 7 is known as the incomplete gamma function. From Chernoff bound, we have 



7 



, . . Pr(-X > -/?) < 
2a 2 2 / v ~ I I 



2cr 2 < 2 
P > fe 
2<r 2 - 2 



A simpler, though looser, upper bound is given in |2*%j : 

7(2, a) < (1 — e 



(5? 



(59) 



E.3 Proof 



Let be any path in the tree, and h{y^\) = w^i(x^) — bk, as in SectionEJ Let fu = maxi<j< m h(k\) 



be the maximum cost function along the transmitted path. From Section |Bj Prop. 1, fu is not 
lesser than the maximum of the cost functions along the path chosen by the stack decoder. From 
Prop. 2, it is easy to see that any node xj 1 is generated, only if the maximum of the cost functions 
along the path x^ does not increase above fu- 

Let A St b be the set of generated nodes. In the proof, we upper bound the number of all the paths 
visited by the algorithm that are different from the correct path, and then we add the complexity of 
finding the correct path (i.e., m). Then, A s ,b is a subset of the set of nodes that satisfy /(x^) < fu- 
Let Rfc,fe be the lower k x k part of the R matrix, i.e., 



t 



R 



k,k 



Tu u • • ■ Th 

n, - rt> fx/ 



l\ 







r l,l / 



Then, we have, 



P{4 e A 



< PdR^ + w^-bkKfM) 




xj + w[ +fc 



-bk < f M + |wi 



r|2 



w[ +fc | 2 + 2(w[ 



r+k\ 



Rfc,A; 





(60) 
(61) 

< / w - K| 2 ) (62) 



where r = n — m is the excess degrees of freedom in the V-BLAST system. From j33], for each 
k < m, one can find an (r+k) x k matrix tl r+ k y k that has the same distribution as the lower (r+k) x k 
part of H, and an (r + k) x (r + fc) unitary matrix 0( r+fc ) whose distribution is independent of R^^, 



such that H r+fc>fc = e^+ fc ) ( R £ 
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Let z\ +k = S^ r+k ^w[ +k . The bound in (|62|) can now be rewritten as 

P{y$eA,, b ) < P(\H r+k>k x k \ 2 + 2(z[ +k ) T H r+k ^ (63) 
In (JS3I), /m + bk — |w^^| 2 can be bounded as 

f M + bk -\ w r+*\2 = maX {bk-\w r r X k \ 2 ,b(k-l)-\wlX k 2 \ 2 ,...j' M } (64) 

< m a x{bk,f' M } (65) 

where f' M = max {0, |w^+}| 2 - b, ... , |w^ +fc+1 | 2 - 6(m - fc)}. Let (3 = max{bk, f' M }. (jSHJ) can 
now be rewritten as 

P(4 e A,, b ) < P(\U r+k ^ k \ 2 + 2(z[ +k ) T ti r+k , k ± k < (3) (66) 
Using Chernoff bound, (JBTIj) can be written as 

P(x^B s , 6 |g fe ,/3)<r ' 9fc> £ (67) 

[1, Qk<(3 

where q k = \H r+k ^ k \ 2 . Let r] = ^. Then, in (J57j) . ^Fp^fc is a chi-square random variable with 

(r + fc) degrees of freedom. Since the three random variables, ~H. r+ktk x k , z\ and f3 are independent, 
averaging over q k and (3 gives 



(68) 



(/■OK POO \ 

jf f qk (Qk)dqk + J e-^- bk ^/^f qk (q k )dq k j+P((3>bk)(Q9) 

In (USD, P(/3 > bk) = P(f' M > bk) < e~ bkX ° for b > 1 (see Section |EH)- Note that one requires 
b > 1 because the distribution of the maximum of the cost functions along the transmitted path 
will depend on m otherwise. 8 The bound in (JfilJj) amounts to counting all the nodes in the 
search space when (3 > bk. Since P{(3 > bk) decreases sufficiently fast as k increases, this upper 
bound is still tight for our purposes. Now, (|6T)j) can be further simplified as, 



rbk poo 

P(* k eA s , b ) < / f qk (q k )dq k + e-^- bk ^/^f qk (q k )dq k + e- bkX ° (70) 

JO Jbk 
rbk poo 

< / fMdq k + / e-^- bk ^^f qk (q k )dq k + e- bkXa (71) 



< 7 



2^^) +ebk/4 r e ' qk/s ^ k ^ k+ ^ bkx ° <™ 

(bk r + k\ e bk ' 4 hhX . . 

* 7 U-~) + (i +i )'^ +e (73) 



3 Later, we will require a stronger condition on b to guarantee the convergence of the sums in H78|) . 
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with 7(-, •) as the incomplete gamma function. Assuming r > 0, (|75j) can be bounded as 

< (^"") l + ((IT|))' + e " lA " (74) 

for i] > b. The inequality in (J73)l follows from an upper bound on the incomplete gamma function 
(see Section |R2|) . For a node xf, let G(x^) = 1 if the node is generated and otherwise. Then, the 

m 

expected number of nodes generated by the algorithm (i.e., complexity) is -EfG^Xj)], where 

k=\ X* 

the expectation is over all channel realizations. Let C m be the expected complexity per dimension. 
Then, assuming a bounded r, C m is written as 9 

-^™ + |e((H 4+ ((^)' + ^)- (75) 

The complexity per dimension, C m , can now be upper bounded as 




when b and r] are sufficiently large, so that all the three sums converge. The inequality in ()77|) is 
true, since the number of nodes at level k is Q . Since the terms inside the parenthesis in ()78j) are 
all independent of m, the number of nodes visited by the stack algorithm scales at most linearly, 
when t] > t]q, where r] is the minimum ratio required for convergence of the sums in (|78|). □ 

Table 1: Complexity Ratio of the proposed algorithm for Rate-3 TAST constellation over Rate-1 
TAST constellation in a 3 x 1 MIMO System 



SNR (dB) 


22 


24 


26 


28 


30 


7 


41 


31 


23 


16 


12 



The first term in the RHS of (|75|l comes from counting the complexity of the finding correct path, i.e., x = 0. 
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Level Level 1 Level 2 Level 3 Level 4 



Figure 1: Tree representation of the paths searched by sequential decoding algorithms in the case 
m — 4. 
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(c) After MMSE-DFE left preprocessing 



(d) Boundary control after right preprocessing 



Figure 2: The effect of left preprocessing on the lattice and the right preprocessing on the informa- 
tion set 
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Figure 3: Complexity and Performance of SE enumeration and Fano decoder for a 20 x 20 16— QAM 
V-BLAST system 
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Figure 4: Complexity and Performance of Fano decoder with ZF-DFE and MMSE-DFE based 
preprocessing for a 30 x 30 4-QAM V-BLAST system 
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Figure 5: Complexity and Performance of Fano decoder with different bias, for a 20 x 20 4— QAM 
V-BLAST system with ZF-DFE preprocessing 
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Figure 6: Performance of MMSE-DFE preprocessing with DFE for a 4 x 4, 4-QAM V-BLAST 

system 
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3 Tx, 1 Rx, 6 bits/channel use, TAST codes 



-x- MMSE-DFE lattice decoding of rate-1 TAST code 
-o- MMSE-DFE lattice decoding of rate-3 TAST code 
-H ML decoding of rate-1 TAST code 
-0- ML decoding of rate-3 TAST code 




Figure 7: Performance of TAST codes under MMSE-DFE lattice decoding and ML detection with 
M = 3 and N = 1. 
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Figure 8: Performance of MMSE-DFE lattice decoding and ML decoding for algebraic space-time 
codes 
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